OcrV1, Main, Exploration, bibRecord, 000D22

Document cleanup using page frame detection

Identifieur interne : 000D22 ( Main/Exploration ); précédent : 000D21; suivant : 000D23

Document cleanup using page frame detection

Auteurs : Faisal Shafait [Allemagne] ; Joost Van Beusekom [Allemagne] ; Daniel Keysers [Allemagne] ; Thomas M. Breuel [Allemagne]

Source :

International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2008.

RBID : Pascal:09-0009437

Descripteurs français

Pascal (Inist)
- Texte, Analyse documentaire, Reconnaissance caractère, Reconnaissance forme, Reconnaissance optique caractère, Représentation spatiale, Image tridimensionnelle, Géométrie algorithmique, Base de données, Recherche documentaire, Recherche image, Recherche information, Speckle, Structure document, Alignement, Présentation document, Perspective, Traitement document.
Wicri :
- topic : Base de données, Recherche documentaire.

English descriptors

KwdEn :
- Alignment, Character recognition, Computational geometry, Database, Document analysis, Document layout, Document processing, Document retrieval, Document structure, Image retrieval, Information retrieval, Optical character recognition, Pattern recognition, Perspective, Spatial representation, Speckle, Text, Tridimensional image.

Abstract

When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ... ) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% for each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000246
to stream PascalFrancis, to step Curation: 000533
to stream PascalFrancis, to step Checkpoint: 000244
to stream Main, to step Merge: 000D34
to stream Main, to step Curation: 000D22

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Document cleanup using page frame detection</title>
<author><name sortKey="Shafait, Faisal" sort="Shafait, Faisal" uniqKey="Shafait F" first="Faisal" last="Shafait">Faisal Shafait</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Image Understanding and Pattern Recognition (IUPR) Research Group, German Research Center for Artificial Intelligence (DFKI)</s1>
<s2>67663 Kaiserslautem</s2>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<placeName><region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Van Beusekom, Joost" sort="Van Beusekom, Joost" uniqKey="Van Beusekom J" first="Joost" last="Van Beusekom">Joost Van Beusekom</name>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>Department of Computer Science, Technical University of Kaiserslautem</s1>
<s2>67663 Kaiserslautem</s2>
<s3>DEU</s3>
<sZ>2 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<placeName><region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Keysers, Daniel" sort="Keysers, Daniel" uniqKey="Keysers D" first="Daniel" last="Keysers">Daniel Keysers</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Image Understanding and Pattern Recognition (IUPR) Research Group, German Research Center for Artificial Intelligence (DFKI)</s1>
<s2>67663 Kaiserslautem</s2>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<placeName><region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Breuel, Thomas M" sort="Breuel, Thomas M" uniqKey="Breuel T" first="Thomas M." last="Breuel">Thomas M. Breuel</name>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>Department of Computer Science, Technical University of Kaiserslautem</s1>
<s2>67663 Kaiserslautem</s2>
<s3>DEU</s3>
<sZ>2 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<placeName><region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">09-0009437</idno>
<date when="2008">2008</date>
<idno type="stanalyst">PASCAL 09-0009437 INIST</idno>
<idno type="RBID">Pascal:09-0009437</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000246</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000533</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000244</idno>
<idno type="wicri:doubleKey">1433-2833:2008:Shafait F:document:cleanup:using</idno>
<idno type="wicri:Area/Main/Merge">000D34</idno>
<idno type="wicri:Area/Main/Curation">000D22</idno>
<idno type="wicri:Area/Main/Exploration">000D22</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Document cleanup using page frame detection</title>
<author><name sortKey="Shafait, Faisal" sort="Shafait, Faisal" uniqKey="Shafait F" first="Faisal" last="Shafait">Faisal Shafait</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Image Understanding and Pattern Recognition (IUPR) Research Group, German Research Center for Artificial Intelligence (DFKI)</s1>
<s2>67663 Kaiserslautem</s2>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<placeName><region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Van Beusekom, Joost" sort="Van Beusekom, Joost" uniqKey="Van Beusekom J" first="Joost" last="Van Beusekom">Joost Van Beusekom</name>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>Department of Computer Science, Technical University of Kaiserslautem</s1>
<s2>67663 Kaiserslautem</s2>
<s3>DEU</s3>
<sZ>2 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<placeName><region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Keysers, Daniel" sort="Keysers, Daniel" uniqKey="Keysers D" first="Daniel" last="Keysers">Daniel Keysers</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Image Understanding and Pattern Recognition (IUPR) Research Group, German Research Center for Artificial Intelligence (DFKI)</s1>
<s2>67663 Kaiserslautem</s2>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<placeName><region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Breuel, Thomas M" sort="Breuel, Thomas M" uniqKey="Breuel T" first="Thomas M." last="Breuel">Thomas M. Breuel</name>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>Department of Computer Science, Technical University of Kaiserslautem</s1>
<s2>67663 Kaiserslautem</s2>
<s3>DEU</s3>
<sZ>2 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<placeName><region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2008">2008</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Alignment</term>
<term>Character recognition</term>
<term>Computational geometry</term>
<term>Database</term>
<term>Document analysis</term>
<term>Document layout</term>
<term>Document processing</term>
<term>Document retrieval</term>
<term>Document structure</term>
<term>Image retrieval</term>
<term>Information retrieval</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Perspective</term>
<term>Spatial representation</term>
<term>Speckle</term>
<term>Text</term>
<term>Tridimensional image</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Texte</term>
<term>Analyse documentaire</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance optique caractère</term>
<term>Représentation spatiale</term>
<term>Image tridimensionnelle</term>
<term>Géométrie algorithmique</term>
<term>Base de données</term>
<term>Recherche documentaire</term>
<term>Recherche image</term>
<term>Recherche information</term>
<term>Speckle</term>
<term>Structure document</term>
<term>Alignement</term>
<term>Présentation document</term>
<term>Perspective</term>
<term>Traitement document</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Base de données</term>
<term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ... ) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% for each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.</div>
</front>
</TEI>
<affiliations><list><country><li>Allemagne</li>
</country>
<region><li>Rhénanie-Palatinat</li>
</region>
<settlement><li>Kaiserslautern</li>
</settlement>
</list>
<tree><country name="Allemagne"><region name="Rhénanie-Palatinat"><name sortKey="Shafait, Faisal" sort="Shafait, Faisal" uniqKey="Shafait F" first="Faisal" last="Shafait">Faisal Shafait</name>
</region>
<name sortKey="Breuel, Thomas M" sort="Breuel, Thomas M" uniqKey="Breuel T" first="Thomas M." last="Breuel">Thomas M. Breuel</name>
<name sortKey="Keysers, Daniel" sort="Keysers, Daniel" uniqKey="Keysers D" first="Daniel" last="Keysers">Daniel Keysers</name>
<name sortKey="Van Beusekom, Joost" sort="Van Beusekom, Joost" uniqKey="Van Beusekom J" first="Joost" last="Van Beusekom">Joost Van Beusekom</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D22 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000D22 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:09-0009437
   |texte=   Document cleanup using page frame detection
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Document cleanup using page frame detection

Document cleanup using page frame detection

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri